Georeferencing Annual Household Surveys in Rural Madagascar: A Comprehensive Analysis of Observatories (1995-2014)
Author
Florent Bédécarrats
Introduction
Background of the Study Objective of the Georeferencing A notebook approach based on R and Quarto
library(tidyverse) # A series of packages for data manipulation
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven) # Required for reading STATA files (.dta)library(labelled) # To work with labelled data from STATAlibrary(sf) # for spatial data handling
Warning: package 'sf' was built under R version 4.2.3
Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse) # for data wrangling and visualizationlibrary(stringdist) # for string distance and matching
Attaching package: 'stringdist'
The following object is masked from 'package:tidyr':
extract
library(tmap) # for mapping
Warning: package 'tmap' was built under R version 4.2.3
library(fuzzyjoin) # for fuzzy joining
Warning: package 'fuzzyjoin' was built under R version 4.2.3
library(readxl) # Read data frames to Excel format
Warning: package 'readxl' was built under R version 4.2.3
library(writexl) # Write data frames to Excel formatlibrary(gt)
Warning: package 'gt' was built under R version 4.2.3
Revisiting the Rural Observatories (ROR) of Madagascar
Historical Background Significance in Rural Development and Policy Decisions
Challenges for opening the data
In this notebook, our primary objective is to enhance the georeferencing of the ROR survey data for open data sharing. The initial ROR survey, initiated in 1995, recorded geographical information in varying formats: from “village” to a combination of “municipality”, “village”, and “site”. A significant challenge arose from the fact that while data collection started in 1995, municipalities were only formally established in 1994, with several years required for stabilization. The inherent fluidity in toponyms, predominantly derived from oral traditions, resulted in varied written representations. Our endeavor is to identify, disambiguate, and georeference observations recorded in the ROR data, adopting the Common Operational Datasets (CODs) as a reference, which has been collaboratively defined by OCHA and Madagascar’s BNGRC (National Disaster Management Office).
CODs stand as the bedrock for all preparedness and response operations, especially within the humanitarian sector. Adopted by the IASC in 2008 and revised in 2010, these datasets are pivotal for facilitating informed decision-making during the critical initial hours of a crisis. By ensuring consistency among stakeholders, they simplify data management and establish a shared operational picture of a crisis. Particularly relevant for our purpose is the incorporation of P-codes in CODs. These unique geographic identification codes, found in both Administrative Boundary CODs (COD-ABs) and Population Statistics CODs (COD-PSs), surmount challenges posed by variations in placenames and spellings. For instance, in Madagascar, 81 different administrative level 4 (ADM4) features are labeled “Morafeno”, with six of these existing within ADM3 features also termed “Morafeno”, distinguishable solely by their unique ADM2 features.
P-codes act as reliable geographic identifiers, eliminating errors arising from identical or variably spelled geographic locations. Leveraging the HDX platform, an open platform for cross-crisis data sharing, we fetch this data to ensure the accurate georeferencing of our ROR data. By harnessing the standardized and official spelling of places provided by P-codes, we can amalgamate, harmonize, and analyze data from diverse sources, offering a comprehensive, geo-accurate view of the survey’s findings.
Data Description and Initial Exploration with R
ROR survey data
The ROR survey data is organized in a collection of year-specfic folders ranging from 1995 to 2015. Each yearly folder houses multiple .dta files (Stata data format) – about 85 per year – with diverse filenames such as “res_as.dta” and “res_bp.dta”. Although there’s an element of harmonization achieved, especially regarding certain variables and household identifiers, the data varies in terms of geographical granularity. Initial years primarily provide a singular field denoting the village name. As the years progress, this evolves to include a municipality name, and in the latter years, an additional “site” name occasionally appears. A comprehensive overview of the observations can be gleaned from the “res_deb.dta” files within each year.
# Function to extract variable info for a given year and fileextract_variable_info <-function(year, file) { file_path <-paste0("Données ROR/enter/", year, "/", file)if (!file.exists(file_path)) return(tibble()) data <-read_dta(file_path, n_max =0)tibble(file_name = file,variable_name =names(data),variable_label =var_label(data) %>%as.character(),year = year)}# Obtain all years from the directory structureyears <-list.dirs("Données ROR/enter/", recursive =FALSE, full.names =FALSE)# Use the tidyverse approach to map over years and filesall_vars <-map_df(years, ~{ files_for_year <-list.files(paste0("Données ROR/enter/", .x), pattern ="\\.dta$", full.names =FALSE)map_df(files_for_year, extract_variable_info, year = .x)})# Convert any NULL values in variable_label to "NA"all_vars$variable_label[is.na(all_vars$variable_label)] <-"NA"# Consolidate the information using the tidyverse approachvariable_dictionary <- all_vars %>%group_by(file_name, variable_name) %>%arrange(year) %>%summarise(variable_label =first(variable_label[variable_label !="NA"] %||%"NA"),years_present =list(unique(year)) ) %>%ungroup() %>%mutate(years_present =map_chr(years_present, ~paste(.x, collapse =",")))
`summarise()` has grouped output by 'file_name'. You can override using the
`.groups` argument.
# Write the variable dictionary to an Excel filewrite_xlsx(variable_dictionary, "ROR_Variable_Dictionary.xlsx")
The code block above creates a data dictionnary, which you can download by clicking on this link.
Administrative boundaries
The “Madagascar Subnational Administrative Boundaries” dataset is sourced from the Common Operational Datasets (CODs), which offer authoritative reference datasets vital for decision-making during humanitarian operations. Specifically designed to streamline the discovery and exchange of pivotal data, CODs ensure uniformity and use the ‘best available’ datasets. This particular dataset focuses on administrative boundaries, including gazetteers with P-codes, facilitating organized humanitarian assessments and data management. P-codes act as unique identifiers for every administrative unit and populated area, ensuring standardization in nomenclature. When datasets adhere to the P-code standard, their integration and analysis become efficient. The dataset provides comprehensive boundary information for Madagascar at five administrative levels: country, region, district, commune, and fokontany. It’s accessible for download as shapefiles from the provided link.
Localities
The “Madagascar Populated Places” dataset is part of the Common Operational Datasets (CODs). This dataset encompasses populated place points for Madagascar. The data has been sourced from the National Geospatial-Intelligence Agency and provided by the University of Georgia - ITOS. Further, the Geographic Information Support Team (GIST) has taken up the role of distributor, with the data being published on 2007-03-07. UN OCHA ROSA has enhanced the dataset by adding P-codes and administrative boundary names, which are based on the BNGRC (National Disaster Management Office) data. The dataset geolocates 28184 populated places with their toponyms (names), codes related to various administrative levels such as fokontany, commune, district, and region, and their spatial coordinates.
Reading layer `mdg_pplp_places_NGA_OCHA' from data source
`C:\Users\fbede\Documents\IRD_Drive\BETSAKA\Données\Entités administratives\OCHA_BNGRC populated places\mdg_pplp_places_NGA_OCHA.shp'
using driver `ESRI Shapefile'
Simple feature collection with 28184 features and 23 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 43.21667 ymin: -25.58333 xmax: 50.48333 ymax: -11.96667
Geodetic CRS: WGS 84
Methodology
Simplify strings
The cases can be a problem, as well as the fact that some attibutes of the toponyms are used sometimes in Malagasy and Sometimes in French.
# A function to harmonize cases and usual variation in the way localities are # written in the survey dataclean_string <-function(x){ x %>%tolower() %>%# Convert to lowercase# Retain spaces, remove other non-alphanumeric charactersstr_replace_all("[^[:alnum:][:space:]]", " ") %>%str_remove_all("\\b(centre|haut|bas|androy)\\b") %>%str_trim() %>%# Trim spaces from start and end of stringstr_replace_all("\\bcentre\\b", "") %>%# Remove the word 'centre'# Translate cardinal pointsstr_replace_all("\\bnord\\b", "avaratra") %>%str_replace_all("\\best\\b", "atsinanana") %>%str_replace_all("\\bouest\\b", "andrefana") %>%str_replace_all("\\bsud\\b", "atsimo") %>%str_replace_all("\\batsinana\\b", "atsinanana") %>%# Replace short form str_replace_all("(fort dauphin)|(taolagnaro)|(tolagnaro)", "tolanaro") # Variations for fort dauphin}
Fuzzy matching
Fuzzy matching is a process that identifies non-exact matches of your query string/pattern. Unlike the traditional matching methods, which require an exact match, fuzzy matching captures the records that are likely to be relevant. It computes a similarity score between the strings being compared, allowing for minor discrepancies such as typos, added or missing characters, and more. This is especially useful in scenarios where data might come from various sources, with slight variations in spelling or naming conventions. A common metric for this is the “Levenshtein distance” which counts the number of edits needed to change one string into another.
fuzzy_match <-function(target_string, dataframe, column_name, observatory_code, max_distance =0.25) {# Filter the dataframe based on observatory_code filtered_reference <- dataframe %>%filter(OBS_CODE == observatory_code) %>%select(all_of(column_name), ADM3_PCODE, ADM3_EN)# If filtered_reference is empty, return NA valuesif (nrow(filtered_reference) ==0) {return(list(matched_string =NA, ADM3_PCODE =NA, ADM3_EN =NA, distance =1)) }# Use stringdist to find the closest match distances <- stringdist::stringdistmatrix(target_string, filtered_reference[[column_name]], method ="jw")# If there are no valid distances, set min_distance to Inf (this should help avoid the error)if(all(is.na(distances))) { min_distance <-Inf } else { min_distance <-min(distances, na.rm =TRUE) # Ensure NA values don't affect the min calculation }# Check for Inf distance and replace it with 1if (is.infinite(min_distance)) { min_distance <-1 }# If min_distance exceeds the max_distance threshold, return NA valuesif (min_distance > max_distance) {return(list(matched_string =NA, ADM3_PCODE =NA, ADM3_EN =NA, distance =NA)) } matched_row <- filtered_reference[which.min(distances), ]return(list(matched_string = matched_row[[column_name]], ADM3_PCODE = matched_row$ADM3_PCODE, ADM3_EN = matched_row$ADM3_EN, distance = min_distance))}
A Detailed Walk-through: Georeferencing in Action
The process followed depends on the completeness/quality of the data obtained in the previous steps, so we follow it by
Method 1
Where we have municipality names in the corresponding field: match these names with the names of the municipalities already identified as data collection sites (the matching is segmented observatory by observatory to reduce the risk). There might be some writing variations so we use fuzzy matching, checking for the likelihood to have a valid match (and maybe enable a visual confirmation)
# List of yearsyears <-1995:2014# Read all datasets and combineall_surveys_description <-map_df(years, function(year) { df <-read_dta(paste0("Données ROR/enter/", year, "/res_deb.dta"))# Convert all columns to character to ensure consistency df <- df %>%mutate_all(as.character)return(df)})# Extract unique combinations and list all the years they appeared inunique_combinations <- all_surveys_description %>%group_by(j0, j42, j4) %>%summarize(years =toString(year)) %>%ungroup()
`summarise()` has grouped output by 'j0', 'j42'. You can override using the
`.groups` argument.
# Harmonize the fields that contain municipality or village namesunique_combinations <- unique_combinations %>%mutate(clean_muni =clean_string(j42),clean_village =clean_string(j4))obs_communes <- obs_communes %>%mutate(clean_ADM3 =clean_string(ADM3_EN))pop_places <- pop_places %>%mutate(clean_pname =clean_string(PLACE_NAME)) # List of observatories for which municipalities have been identifiedidentified_observatories <-unique(obs_communes$OBS_CODE) %>%na.omit()# Filter for the observatory for which we already have a manual identification# of municipalitiesunique_combinations <- unique_combinations %>%filter(j0 %in% identified_observatories) # Apply the fuzzy matching observatory-wiseresults <-map2_df(unique_combinations$clean_muni, unique_combinations$j0, ~as.data.frame(t(fuzzy_match(.x, obs_communes, "clean_ADM3", .y)))) %>%unnest(cols =c(matched_string, distance, ADM3_PCODE, ADM3_EN))# Combine the results with the unique_combinationscorrespondence_table <-bind_cols(unique_combinations, results) # Add a column for the matching methodcorrespondence_table <- correspondence_table %>%mutate(method =ifelse(!is.na(matched_string), "method_1", NA_character_))
As a result of this first matching step, we have the following number of successfully matched observations.
For observations that had no municipality name. Try parsing the strings in the village name field to see if they contain a municipality name of a municipality already identified within the observatory surveyed municipalities.
# Filter out matched municipalities and extract the first wordunmatched_results <- correspondence_table %>%filter(is.na(matched_string)) %>%select(j0:clean_village) %>%mutate(first_word =str_extract(clean_village, "^[^\\s/]+"))# Fuzzy matching with the first word and identified municipalitiesresults_step2 <-map2_df(unmatched_results$first_word, unmatched_results$j0, ~as.data.frame(t(fuzzy_match(.x, obs_communes, "clean_ADM3", .y)))) %>%unnest(cols =c(matched_string, distance, ADM3_PCODE, ADM3_EN))# Update Resultspotential_matches2 <-bind_cols(unmatched_results, results_step2)# Manually identify false positives and remove themfalse_positives <-c("madiromionga", "maroarla", "tsaratanteraka", "ambatoharanana", "ambatoaranana", "maroala", "erakoja", "erakoka", "maroalo", "erakka", "erakoa")validated_matches2 <- potential_matches2 %>%mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),~ifelse(first_word %in% false_positives, NA, .)),method =ifelse(!is.na(matched_string), "method_2", NA_character_))# Integrate new results in correspondence tablecorrespondence_table <- correspondence_table %>%filter(!is.na(matched_string)) %>%bind_rows(validated_matches2)
# Extract matched resultsmatched_results <-unique(correspondence_table$ADM3_PCODE) %>%na.omit()# Keep municipalities in thosematched_spatial <- obs_communes %>%filter(ADM3_PCODE %in% matched_results)############################################################################## Big problem with spatial coordianates TO SOLVE LATER# Handle NULL values by assigning them NAcorrespondence_table_updated$longitude[sapply(correspondence_table_updated$longitude, is.null)] <-NAcorrespondence_table_updated$latitude[sapply(correspondence_table_updated$latitude, is.null)] <-NA# Convert the remaining list items to numericcorrespondence_table_updated$longitude <-sapply(correspondence_table_updated$longitude, function(x) {if (is.list(x) &&!is.null(x[[1]])) {return(as.numeric(x[[1]])) } else {return(as.numeric(x)) }})correspondence_table_updated$latitude <-sapply(correspondence_table_updated$latitude, function(x) {if (is.list(x) &&!is.null(x[[1]])) {return(as.numeric(x[[1]])) } else {return(as.numeric(x)) }})# Filter rows where both longitude and latitude are not NAcorrespondence_table_updated <- correspondence_table_updated[!is.na(correspondence_table_updated$longitude) &!is.na(correspondence_table_updated$latitude), ]# Convert to sf objectcorrespondence_sf <-st_as_sf(correspondence_table_updated, coords =c("longitude", "latitude"), crs =4326)#########################################################################Vahatra <-st_read("../Aires protégées/Vahatra/Shapefiles/AP_Vahatra.shp")
Reading layer `AP_Vahatra' from data source
`C:\Users\fbede\Documents\IRD_Drive\BETSAKA\Données\Aires protégées\Vahatra\Shapefiles\AP_Vahatra.shp'
using driver `ESRI Shapefile'
Simple feature collection with 98 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 43.25742 ymin: -25.60502 xmax: 50.47724 ymax: -11.98301
Geodetic CRS: WGS 84